{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9d59582a-6473-4b34-929b-3e94cb443c3d",
   "metadata": {},
   "source": [
    "# How to add scores to retriever results\n",
    "\n",
    "Retrievers will return sequences of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, which by default include no information about the process that retrieved them (e.g., a similarity score against a query). Here we demonstrate how to add retrieval scores to the `.metadata` of documents:\n",
    "1. From [vectorstore retrievers](/docs/how_to/vectorstore_retriever);\n",
    "2. From higher-order LangChain retrievers, such as [SelfQueryRetriever](/docs/how_to/self_query) or [MultiVectorRetriever](/docs/how_to/multi_vector).\n",
    "\n",
    "For (1), we will implement a short wrapper function around the corresponding vector store. For (2), we will update a method of the corresponding class.\n",
    "\n",
    "## Create vector store\n",
    "\n",
    "First we populate a vector store with some data. We will use a [PineconeVectorStore](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html), but this guide is compatible with any LangChain vector store that implements a `.similarity_search_with_score` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b8cfcb1b-64ee-4b91-8d82-ce7803834985",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.documents import Document\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_pinecone import PineconeVectorStore\n",
    "\n",
    "docs = [\n",
    "    Document(\n",
    "        page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
    "        metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
    "        metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
    "        metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
    "        metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Toys come alive and have a blast doing so\",\n",
    "        metadata={\"year\": 1995, \"genre\": \"animated\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
    "        metadata={\n",
    "            \"year\": 1979,\n",
    "            \"director\": \"Andrei Tarkovsky\",\n",
    "            \"genre\": \"thriller\",\n",
    "            \"rating\": 9.9,\n",
    "        },\n",
    "    ),\n",
    "]\n",
    "\n",
    "vectorstore = PineconeVectorStore.from_documents(\n",
    "    docs, index_name=\"sample\", embedding=OpenAIEmbeddings()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ac5ef6-ce18-427f-a91c-62b38a8b41e9",
   "metadata": {},
   "source": [
    "## Retriever\n",
    "\n",
    "To obtain scores from a vector store retriever, we wrap the underlying vector store's `.similarity_search_with_score` method in a short function that packages scores into the associated document's metadata.\n",
    "\n",
    "We add a `@chain` decorator to the function to create a [Runnable](/docs/concepts/#langchain-expression-language) that can be used similarly to a typical retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7e5677c3-f6ee-4974-ab5f-a0f50c199d45",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.runnables import chain\n",
    "\n",
    "\n",
    "@chain\n",
    "def retriever(query: str) -> List[Document]:\n",
    "    docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n",
    "    for doc, score in zip(docs, scores):\n",
    "        doc.metadata[\"score\"] = score\n",
    "\n",
    "    return docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c9cad75e-b955-4012-989c-3c1820b49ba9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),\n",
       " Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),\n",
       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),\n",
       " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = retriever.invoke(\"dinosaur\")\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6671308a-be8d-4c15-ae1f-5bd07b342560",
   "metadata": {},
   "source": [
    "Note that similarity scores from the retrieval step are included in the metadata of the above documents."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af2e73a0-46a1-47e2-8103-68aaa637642a",
   "metadata": {},
   "source": [
    "## SelfQueryRetriever\n",
    "\n",
    "`SelfQueryRetriever` will use a LLM to generate a query that is potentially structured-- for example, it can construct filters for the retrieval on top of the usual semantic-similarity driven selection. See [this guide](/docs/how_to/self_query) for more detail.\n",
    "\n",
    "`SelfQueryRetriever` includes a short (1 - 2 line) method `_get_docs_with_query` that executes the `vectorstore` search. We can subclass `SelfQueryRetriever` and override this method to propagate similarity scores.\n",
    "\n",
    "First, following the [how-to guide](/docs/how_to/self_query), we will need to establish some metadata on which to filter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8280b829-2e81-4454-8adc-9a0930047fa2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains.query_constructor.base import AttributeInfo\n",
    "from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "metadata_field_info = [\n",
    "    AttributeInfo(\n",
    "        name=\"genre\",\n",
    "        description=\"The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']\",\n",
    "        type=\"string\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"year\",\n",
    "        description=\"The year the movie was released\",\n",
    "        type=\"integer\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"director\",\n",
    "        description=\"The name of the movie director\",\n",
    "        type=\"string\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
    "    ),\n",
    "]\n",
    "document_content_description = \"Brief summary of a movie\"\n",
    "llm = ChatOpenAI(temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a6c6fa8-1e2f-45ee-83e9-a6cbd82292d2",
   "metadata": {},
   "source": [
    "We then override the `_get_docs_with_query` to use the `similarity_search_with_score` method of the underlying vector store: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "62c8f3fa-8b64-4afb-87c4-ccbbf9a8bc54",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any, Dict\n",
    "\n",
    "\n",
    "class CustomSelfQueryRetriever(SelfQueryRetriever):\n",
    "    def _get_docs_with_query(\n",
    "        self, query: str, search_kwargs: Dict[str, Any]\n",
    "    ) -> List[Document]:\n",
    "        \"\"\"Get docs, adding score information.\"\"\"\n",
    "        docs, scores = zip(\n",
    "            *vectorstore.similarity_search_with_score(query, **search_kwargs)\n",
    "        )\n",
    "        for doc, score in zip(docs, scores):\n",
    "            doc.metadata[\"score\"] = score\n",
    "\n",
    "        return docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56e40109-1db6-44c7-a6e6-6989175e267c",
   "metadata": {},
   "source": [
    "Invoking this retriever will now include similarity scores in the document metadata. Note that the underlying structured-query capabilities of `SelfQueryRetriever` are retained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3359a1ee-34ff-41b6-bded-64c05785b333",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever = CustomSelfQueryRetriever.from_llm(\n",
    "    llm,\n",
    "    vectorstore,\n",
    "    document_content_description,\n",
    "    metadata_field_info,\n",
    ")\n",
    "\n",
    "\n",
    "result = retriever.invoke(\"dinosaur movie with rating less than 8\")\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "689ab3ba-3494-448b-836e-05fbe1ffd51c",
   "metadata": {},
   "source": [
    "## MultiVectorRetriever\n",
    "\n",
    "`MultiVectorRetriever` allows you to associate multiple vectors with a single document. This can be useful in a number of applications. For example, we can index small chunks of a larger document and run the retrieval on the chunks, but return the larger \"parent\" document when invoking the retriever. [ParentDocumentRetriever](/docs/how_to/parent_document_retriever/), a subclass of `MultiVectorRetriever`, includes convenience methods for populating a vector store to support this. Further applications are detailed in this [how-to guide](/docs/how_to/multi_vector/).\n",
    "\n",
    "To propagate similarity scores through this retriever, we can again subclass `MultiVectorRetriever` and override a method. This time we will override `_get_relevant_documents`.\n",
    "\n",
    "First, we prepare some fake data. We generate fake \"whole documents\" and store them in a document store; here we will use a simple [InMemoryStore](https://api.python.langchain.com/en/latest/stores/langchain_core.stores.InMemoryBaseStore.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a112e545-7b53-4fcd-9c4a-7a42a5cc646d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.storage import InMemoryStore\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "# The storage layer for the parent documents\n",
    "docstore = InMemoryStore()\n",
    "fake_whole_documents = [\n",
    "    (\"fake_id_1\", Document(page_content=\"fake whole document 1\")),\n",
    "    (\"fake_id_2\", Document(page_content=\"fake whole document 2\")),\n",
    "]\n",
    "docstore.mset(fake_whole_documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "453b7415-4a6d-45d4-a329-9c1d7271d1b2",
   "metadata": {},
   "source": [
    "Next we will add some fake \"sub-documents\" to our vector store. We can link these sub-documents to the parent documents by populating the `\"doc_id\"` key in its metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "314519c0-dde4-41ea-a1ab-d3cf1c17c63f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['62a85353-41ff-4346-bff7-be6c8ec2ed89',\n",
       " '5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',\n",
       " '8c1d9a56-120f-45e4-ba70-a19cd19a38f4']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = [\n",
    "    Document(\n",
    "        page_content=\"A snippet from a larger document discussing cats.\",\n",
    "        metadata={\"doc_id\": \"fake_id_1\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A snippet from a larger document discussing discourse.\",\n",
    "        metadata={\"doc_id\": \"fake_id_1\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A snippet from a larger document discussing chocolate.\",\n",
    "        metadata={\"doc_id\": \"fake_id_2\"},\n",
    "    ),\n",
    "]\n",
    "\n",
    "vectorstore.add_documents(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e391f7f3-5a58-40fd-89fa-a0815c5146f7",
   "metadata": {},
   "source": [
    "To propagate the scores, we subclass `MultiVectorRetriever` and override its `_get_relevant_documents` method. Here we will make two changes:\n",
    "\n",
    "1. We will add similarity scores to the metadata of the corresponding \"sub-documents\" using the `similarity_search_with_score` method of the underlying vector store as above;\n",
    "2. We will include a list of these sub-documents in the metadata of the retrieved parent document. This surfaces what snippets of text were identified by the retrieval, together with their corresponding similarity scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1de61de7-1b58-41d6-9dea-939fef7d741d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "from langchain.retrievers import MultiVectorRetriever\n",
    "from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
    "\n",
    "\n",
    "class CustomMultiVectorRetriever(MultiVectorRetriever):\n",
    "    def _get_relevant_documents(\n",
    "        self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
    "    ) -> List[Document]:\n",
    "        \"\"\"Get documents relevant to a query.\n",
    "        Args:\n",
    "            query: String to find relevant documents for\n",
    "            run_manager: The callbacks handler to use\n",
    "        Returns:\n",
    "            List of relevant documents\n",
    "        \"\"\"\n",
    "        results = self.vectorstore.similarity_search_with_score(\n",
    "            query, **self.search_kwargs\n",
    "        )\n",
    "\n",
    "        # Map doc_ids to list of sub-documents, adding scores to metadata\n",
    "        id_to_doc = defaultdict(list)\n",
    "        for doc, score in results:\n",
    "            doc_id = doc.metadata.get(\"doc_id\")\n",
    "            if doc_id:\n",
    "                doc.metadata[\"score\"] = score\n",
    "                id_to_doc[doc_id].append(doc)\n",
    "\n",
    "        # Fetch documents corresponding to doc_ids, retaining sub_docs in metadata\n",
    "        docs = []\n",
    "        for _id, sub_docs in id_to_doc.items():\n",
    "            docstore_docs = self.docstore.mget([_id])\n",
    "            if docstore_docs:\n",
    "                if doc := docstore_docs[0]:\n",
    "                    doc.metadata[\"sub_docs\"] = sub_docs\n",
    "                    docs.append(doc)\n",
    "\n",
    "        return docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7af27b38-631c-463f-9d66-bcc985f06a4f",
   "metadata": {},
   "source": [
    "Invoking this retriever, we can see that it identifies the correct parent document, including the relevant snippet from the sub-document with similarity score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "dc42a1be-22e1-4ade-b1bd-bafb85f2424f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever = CustomMultiVectorRetriever(vectorstore=vectorstore, docstore=docstore)\n",
    "\n",
    "retriever.invoke(\"cat\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
